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Abstract 

Background: The synthesis of information across microarray studies has been performed by combining statistical 
results of individual studies (as in a mosaic), or by combining data from multiple studies into a large pool to be 
analyzed as a single data set (as in a melting pot of data). Specific issues relating to data heterogeneity across 
microarray studies, such as differences within and between labs or differences among experimental conditions, could 
lead to equivocal results in a melting pot approach. 

Results: We applied statistical theory to determine the specific effect of different means and heteroskedasticity 
across 1 9 groups of microarray data on the sign and magnitude of gene-to-gene Pearson correlation coefficients 
obtained from the pool of 1 9 groups. We quantified the biases of the pooled coefficients and compared them to the 
biases of correlations estimated by an effect-size model. Mean differences across the 19 groups were the main factor 
determining the magnitude and sign of the pooled coefficients, which showed largest values of bias as they 
approached ±1 . Only heteroskedasticity across the pool of 1 9 groups resulted in less efficient estimations of 
correlations than did a classical meta-analysis approach of combining correlation coefficients. These results were 
corroborated by simulation studies involving either mean differences or heteroskedasticity across a pool of N > 2 
groups. 

Conclusions: The combination of statistical results is best suited for synthesizing the correlation between expression 
profiles of a gene pair across several microarray studies. 



Background 

There is a wealth of information enclosed in the mas- 
sive amount of microarray data so far accumulated in 
public repositories. The variety of data sets generated 
from the assortment of experiments is a major obstacle in 
the path leading from these data to information. Specific 
issues relating to data heterogeneity across microarray 
studies include differences across platforms, differences 
within and between labs, and/or differences among exper- 
imental factors such as treatments and tissues [1,2]. Fur- 
thermore, concerns regarding integration of studies from 
multiple sources in general, such as variations in design, 
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research goals, or quality of implementation, add to these 
issues [3,4]. 

Inappropriate integration of microarray data from pub- 
lic repositories could lead to equivocal results [5]. The 
"Simpsons paradox" [6], which refers to contradictory sta- 
tistical results obtained when analysis is performed within 
versus across groups of data [7], is an example of mis- 
handling of data. Blyth [8] gives an example involving 
the analysis of 2x2 contingency tables across two groups, 
and Hassler and Thadewald [9] also illustrate Simpsons 
paradox when correlation coefficients are estimated from 
a pool of two groups versus within each group. In both 
cases, the paradox can be explained as results are further 
investigated in light of the specific statistical properties 
of each group of data. The "ecological fallacy" happens 
when the correlation of aggregated variables results in 
a significant relationship that is due only to aggrega- 
tion rather than to any real association [10] (p. 285). An 
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early example of an ecological fallacy can be found in 
Gehlke and Biehl [11], whose study of grouping effects 
in census tract data showed that the magnitude of cor- 
relation coefficients of two variables tend to increase 
as the level of census tract aggregation increases. This 
problem was later referred to as the "modifiable areal 
unit problem" and further studied by Openshaw and 
Taylor [12]. 

Combining statistical results (e.g., parameter estimates, 
p-values) of independent studies that address similar 
questions has been a standard procedure in classic meta- 
analyses [4,13]. This approach entails analyzing each data 
set independently and then combining the results, as 
in a mosaic. Meta-analysis of microarray data has been 
applied in a broader context, as some works include 
data spanning a wide range of purposes and designs. 
Parmigiani et al.[14], in a quest for a common gene 
signature across multiple cancer types, developed a sta- 
tistical method to identify and assess the intersection 
of multiple gene expression signatures across 40 pub- 
lished cancer-related microarray studies. On the other 
hand, Wirapati et al [15] and Rhodes et al. [16] devel- 
oped specific meta-analysis methods to integrate gene 
expression signatures of breast and lung cancer, respec- 
tively, across independent studies of microarray data. 
Hu et al. [17] and Borozan et al. [18] proposed meth- 
ods that extend traditional effect-size models to combine 
information from different microarray studies as a way 
to evaluate or unify lists of genes differentially expressed 
across them. 

Another approach combines data from multiple 
microarray studies (termed "pooled data") in a melting 
pot of data and analyzes them as a single data set. Kim and 
Webster [19] used public databases containing microarray 
data and biological traits on cytoarchitectural abnormali- 
ties from the same samples of patients belonging to three 
groups of major mental disorders plus a control group. 
Their study used gene expression data measured through 
two array types, the Affymetrix Human Genome U133 
Set A and the Affymetrix Human Genome 95av2, and the 
authors carried out a correlation analysis between each 
gene expression and the biological traits of each subject; 
although not fully described in the paper, it seems the 
correlation analysis was performed on the pooled data 
set from independent studies. Subsequent gene ontol- 
ogy (GO) [20] enrichment analysis revealed significant 
overrepresentation of biological processes, such as cel- 
lular metabolism, central nervous system development, 
cell motility, and programmed cell death, in groups of 
genes that were significantly correlated with biological 
traits. Mentzen and Wurtele [21] and Horan et al. [22] 
have created co-expression networks for Arabidopsis 
thaliana based on parameters of co-expression similar- 
ity that were estimated from a large pool of microarray 



data downloaded from public repositories. Mentzen and 
Wurtele [21] pooled data from 963 Affymetrix gene 
chips, distributed across 71 independent studies encom- 
passing diverse organs, conditions, and genotypes "to 
quest the transcriptome in response to a wide variety 
of environmentally, genetically, and developmentally 
induced perturbations". Horan et al. [22] pooled data 
corresponding to 1310 Affymetrix microarrays divided 
among 41 independent studies. Both works used cluster 
analysis based on Pearson correlation coefficients as a 
measure of similarity of gene expression profiles from 
Arabidopsis. Mentzen and Wurtele [21] analyzed data 
from 21,000 gene probes on the gene chip and identi- 
fied clusters of co-expressed genes as regulons. Horan 
et al. [22] used clusters to identify groups of co-regulated 
protein of unknown function and protein of known func- 
tion encoding genes from Arabidopsis. In both works, 
GO enrichment analysis showed that networks based on 
gene-to-gene correlations estimated from pooling data 
from multiple microarray studies were not random. A 
similar approach has been used to obtain regulon infor- 
mation from a human transcriptomic network derived 
from almost 20,000 microarrays [23]. This analysis also 
showed a non random functional distribution of regulons. 

From a statistical standpoint, combining data from inde- 
pendent microarray studies into a large pool as a single 
set can be acceptable if data homogeneity can be ensured 
across studies. Yet, this condition is nearly impossible 
to ensure considering that significant data heterogene- 
ity is reported even for completely replicated microarray 
experiments carried out by the same lab [1]. Never- 
theless, it can be argued that GO enrichment implies 
meaningful biology and significant GO enrichment has 
been shown for networks created from pooled data 
[19,21-23]. Moreover, information gathered through a sin- 
gle data set analysis has led to gene function knowledge 
discovery [24]. 

The objective of this study was to perform a comprehen- 
sive analysis of Pearson correlation coefficients estimated 
from pooling heterogeneous groups of data (melting pot 
approach) in a large-scale gene expression analysis of pub- 
licly available Affymetrix microarrays and compare it to 
the analysis (of the same data) that combines statistical 
results of individual groups (mosaic approach). Our study 
included two specific objectives: (a) to determine the 
specific effect of different means and heteroskedasticity 
across the many groups comprising a pool of microarray 
data on the sign and magnitude of gene-to-gene Pearson 
correlation coefficients obtained from the pool of data, 
and (b) to quantify the extent of bias in gene-to-gene 
Pearson correlation coefficients obtained from a pool of 
heterogeneous groups of microarray data. 

In the "Methods" section of this article, we describe 
the statistical theory that we applied to analyze the 
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components of Pearson correlation coefficients obtained 
from a pool of heterogeneous microarray groups. The 
"Simulation study" section provides results of a study 
that further tests the specific effect on Pearson corre- 
lation coefficients of only mean differences, and only 
heteroskedasticity across N > 2 groups, and the valid- 
ity of our methodology when groups have a small number 
of elements. In the section "Application to experimen- 
tal microarray data" we illustrate the results predicted by 
both theory and simulation with data from 10 microar- 
ray experiments. At the end of this section, we provide 
an assessment of the bias of correlation coefficients esti- 
mated across a pool of heterogeneous groups of microar- 
ray data. We discuss our results and summarize our 
conclusions in the last section. 



Methods 

Dissecting components of the Pearson correlation 
coefficient obtained from a pool of microarray data 

Hassler and Thadewald [9] developed the asymptotic for- 
mulation to quantify and explain differences between the 
Pearson correlation coefficient estimated from combining 
two heterogeneous groups into one pool and the Pear- 
son correlation coefficients estimated within each group. 
They illustrated the problem with a set of measurements 
on height (cm) and weight (kg) reported by 184 first-year 
college students with a roughly even number of males 
and females. As the authors emphasized in their work, 
male and female groups were not homogeneous because 
"male students are taller and heavier than female students, 
and variation around the mean also differs between the 
groups". 

The generalization of Hassler and Thadewalds [9] 
asymptotic analysis for N heterogeneous groups (refer to 
their original work for specifics about the asymptotic anal- 
ysis) is provided in Equation 1. For the purpose of applying 
their theoretical work to analyze correlation coefficients 
obtained from a pool of microarray data, we consider 
N heterogeneous groups of gene expression data mea- 
sured through microarrays. Each group of data can be 
described as a matrix Mi of g genes by Yi{ columns (each 
column of the matrix M{ corresponds to the expression of 
g genes measured through one microarray). We assume 
that expression levels of any given gene pair xy within 
each group, i.e. x,y e Mu are bivariate random nor- 
mal variables that are identically distributed with means 
fx X y fi = (/ji X) i, /Jiyj) and variance-covariance matrix E^/ = 

x ,i xy,i ) y; = in Therefore, heterogeneities across 

J xy,i °y,i) 

N groups of microarray data are characterized by fi X y f i ^ 
and/or E^- ^ for i ^ j. 
The limit in probability of the Pearson correlation 
coefficient between expressions of genes x and y, r xy , 



obtained from a pool of N heterogeneous groups, as ni 
oo, is given by the expression in Equation 1: 



i xy 



X^'=l ^i G xy,i + Si=l S/=/+l ^ifyiflxj ^x,j)(f^y,i l^y,j) 



8 x 8y 



(1) 



where A/ = 



represents the weight of the num- 



ber of microarrays of each group and the terms 8% and 
8y correspond to the average of expression level vari- 
ances weighted by Yi{ plus the weighted average of the 
square of mean differences across N groups for genes x 
and y, respectively (Equations 2 and 3). Hence, the limit 
in probability of gene-to-gene Pearson correlation coeffi- 
cients obtained from combining heterogeneous groups of 
microarray data is a mixture of the weighted average of all 
covariances across N groups plus the weighted average of 
the cross product of the mean differences of genes x and y 
across N groups. Both terms are divided by a combination 
of the average of variances of genes x and y and the mean 
differences of genes x and y across N groups. 



N 



N N 



i=l j=i+l 



N N N 

i—1 i—1 j—i+1 



V>y,j) 



(2) 



(3) 



Results 

Simulation study 

This section presents results of a study using simulated 
data that had the purpose of further investigating correla- 
tion coefficients obtained from a pool of N groups under 
the following specific conditions: (a) occurrence of only 
mean differences across N > 2 groups and (b) occur- 
rence of only heteroskedasticity across N > 2 groups. Our 
study also evaluated the validity of using estimates of the 
asymptotic expression given in Equation 1 to explain the 
components of the pooled correlation coefficient when 
groups comprising the pool of data have a small number 
of elements Yi{. 

We performed simulations in R [25] for a generic gene 
pair xy following the procedure detailed in 1-4: 

1. For each simulated group i, i = 1,N, generate n{ data 
pairs from a multivariate normal distribution with 
parameters \i xy ^ and ^ X y,i (we used the function 
mvrnorm in MASS [26]). Each simulated group is a 
matrix of 2 rows (genes) by Yi{ columns (number of 
elements ni). 
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2. Combine simulated data of N groups into one pool 
of 2 rows (genes) by Y^dLi n i columns. 

3. Obtain the Pearson correlation coefficient from the 
pool of data (we used the function cor in R [25]). 

4. Repeat steps 1-3 above 1000 times; results are 
presented as averages over 1000 repetitions. 

As a control, we first performed an experiment with 



parameters fi X y,i = (0,0) and 



'xy, 



1 ~ (o i) 



for all 



groups i = 1, 10. This simulation provided nearly zero 



correlation coefficients (—0.004 



'xy 



0.003), thus 



reassuring us that our simulation procedure worked as 
expected. 

Simulation of only mean differences across a pool of N 
groups 

First, we analyzed the case in which heterogeneities across 
N groups of simulated data were due only to mean differ- 
ences, i.e. jjL X y 7 i 7^ fi X y 7 j for i ^ j, but variance-covariances 
remained constant, i.e. H X y,i — ^xy We first simulated the 
case of zero correlation within each group and the effect 
of differing means (by a parameter a) in only two of the 
N groups. The simulation results for the set of parame- 
ters \Jb xy x = (a, 0), /jL xy> 2 = (0, a), —10 < a < 10, and 

fi xy) i = (2,2) for i > 3; £^,; = l)' Hi = 10, ^ = 
0.01 < X < 0.1, V i = 1,7V, and 10 < N < 100 are 
shown in Figure 1. The pooled correlation coefficients r xy 
shown in Figure 1 are positive for a < 2, negative for 
a > 2, and nearly zero for a = 2. Even though data pairs 
in each group were drawn from populations with zero cor- 
relations, r xy ranged from —0.56 to 0.48. Also shown in 
Figure 1 is the non-linear relationship between r xy and a 
as well as between r xy and X. Not surprisingly, coefficients 
r xy increase as X increases. The smooth curves observed 
in Figure 1 (as well as non-linearity of r xy with a and X) are 
explained by the asymptotic formulation of Equation 1 as 
written for the set of population parameters used in this 
simulation case study (Equation 4): 



L xy 



-A 2 a 2 -4A(l-2A)(a-2) 

X(l - 2X)(a - 2) 2 + 4A,(1 - 2A.) + 1 



(4) 



Through Equation 4 one can see that 4A 2 +4A q_2A)+i ^ ^ 
for a = 2 and 0.01 < X < 0.1; x xy > 0 for a < 2 because 
the term — 4A(1 — 2X)(a — 2) > 0 and dominates the term 
—X 2 a 2 < 0; < 0 for a > 2 because — 4A(1 — 2X)(a — 
2) < 0 and -X 2 a 2 < 0. 

Secondly, we simulated the case in which the means 
of a gene pair xy differ for all N groups but the correla- 
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Figure 1 Mean differences across a pool of N groups causes 
spurious correlations. r xy was obtained from combining N groups of 
simulated data; simulation parameters: /x X y ( i = (a, 0), fz xyi2 = (0,a), 

-]0<a< 10,and/x xy ,/ = (2 / 2)for/>3;S^ = 0 1 
n, = 10, ki = X, 0.01 < X < 0.1,V/' = 1,/V,and 10 < N < 100. 



-0.9 < p xy < 0.9, 



tion p xy 7^ 0 assumes the same value within each group. 
The simulation results for the set of parameters ii xy> i = 

(i,N — (i — 1)), Ti X y } i = (^ P f 

m = 10, Xi = X, 0.01 < X < 0.1, for i = 1,7V, and 
10 < N < 100 are shown in Figure 2a. The correlation 
coefficients r xy shown in Figure 2a are always negative, 
ranging from —0.99 < r xy < —0.80, even when the true 
correlation within groups p xy was positive. The asymp- 
totic formulation of Equation 1 as written for the set of 
parameters used in this simulation case study is given in 
Equation 5: 



N 



L xy 



i + ^ 2 EtiEf= i+ iO--;) 



N 



(5) 



Through Equation 5 one can see that x xy assumes values 
of nearly —1 for 10 < N < 100. A visualization of the 
problem considered in our second simulation case study is 
shown (Figure 2b) through a scatterplot of expression data 
of a gene pair xy simulated for 10 groups according to the 
parameters n xy ,i = {h (11 - 0)> Pxy = 0.9, and n t = 50, 
for i = 1, 10. As shown in Figure 2b, even though there 
is a positive trend between the expressions of gene x and 
gene y within each of the 10 groups, the trend between 
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Expression levels of gene X (arbitrary units) 

Figure 2 Mean differences across a pool of N groups causes Simpson's paradox, (a) r xy was obtained from combining N groups of simulated 

. All other simulation 



_ / 1 Pxy \ 
'"'{pxy 1 J 



data; simulation parameter p xy is the true correlation of the pairxy within each group /' = 1, A/ for H X y, 

parameters are as follows: fi xyJ = (/', N - (/' - 1 )), -0.9 < p xy < 0.9, m = 1 0, kj = X, 0.01 < X < 0.1 , for / = 1 , N, and 1 0 < N < 1 00. (b) Scatterplot 
of a pair xy obtained with the simulation parameters: /x xy j = (7,(1 1 - /')), p xy = 0.9 and n,- = 50 for/' = 1, 10 groups. This plot shows clearly that 
even though there is a positive trend within each of the 1 0 groups, the trend across the pool of 1 0 groups is negative (Simpson's paradox). 



expression of the gene pair xy across the pool of all groups gives the asymptotic formulation for the set of parameters 
is negative. used in this simulation case study: 



Simulation of only heteroskedasticity across a pool of N 
groups 

Through simulation, we analyzed the effect of varia- 
tions in Ti X y } i = ( ° x ' 1 ° xy 2 J among N groups, keep- 

ing fjL X y f i = /a constant. Simulation of the case where 
a xy>i = 0 but o 2 xi ^ o 2 x . and crfa # o 2 y . for i # j 
resulted in nearly zero correlation coefficients from the 
pool of N groups (-0.04 < r xy < 0.05). This result 
agrees with Equation 1, which predicts zero correlation 
if covariances and mean differences of a gene pair across 
all groups are zero, even when variances differ across 
groups. 

We performed a simulation experiment in which the 
variance of gene x changes across N groups, but the vari- 
ance of gene y and the correlation between genes x and 
y remain constant across N groups. Results of this exper- 
iment for the set of parameters \i xy ^ = (2,2), Hxyj = 

( G li po *A a l = p, -0.9 < p < 0.9, m = 10, ki = k, 
\pcr X)i 1 J x > 1 

and 0.01 < k < 0.1 for i = 1,N and 10 < < 100 are 
shown in Figure 3. The range of the pooled correlations 
is —0.8 < r xy < 0.8 (shown on the y-axis of Figure 3), 
whereas the range of the true correlations within groups 
is —0.9 < p < 0.9 (shown on the x-axis of Figure 3). 
Figure 3 shows a clear linear relationship between r xy and 
p, in which \r xy \ < \p\. The slope between r xy and p, esti- 
mated through ordinary least squares, is 0.872. Equation 6 




where o x = and = — Equation 6 also 

shows a linear relationship between x xy and p with a slope 
of -% = 0.872, for a xi = i 2 , i = 1,7V, and 10 < N < 100. 

There is also a linear relationship between the average 
of correlation coefficients within each group r xy and p 
(Figure 3), in which -0.889 < r xy < +0.889. In addition, 
as shown in Figure 3, \r xy \ > \r xy \. Hence, it can be easily 
inferred that the mean squared error between r xy and p 
(the true correlation within each group) is greater than the 
mean squared error between r xy and p. Therefore, pools 
of N simulated groups marked by only heteroskedas- 
ticity provide less efficient estimates of the correlation 
across groups than combining the N groups' correlation 
coefficients into an average. 

We analyzed the mean squared error between r xy and 
x xy versus the number of simulated elements in each 
group, for 10 < Yi{ < 100 and N = 20 (data shown in 
Additional file 1). x xy was obtained from plugging popu- 
lation parameters \i xy ^ and T> X y,i m to Equation 1, whereas 
x xy was based on a combination of parameters of each 
group pL xy> i and E^,- (see Equation 7). The mean squared 
error ranged from 0.0004 for nt = 100 to 0.004 for m = 10. 
The correspondence between x xy and x xy was good even 
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Figure 3 Correlation coefficients obtained from a pool of N 
heteroskedastic groups. r xy was obtained from pooling N groups of 
simulated data (shown in red), and 7 xy was obtained from averaging 
within groups correlation (shown in blue); p is the true correlation 
within each group. This plot shows that the error between 7 xy and p is 
smaller than the error between r xy and p. Simulation parameters: 

/ 1 p X y 

flxyj — (2, 2), 2jxy,i = I ^ ^ 

m = 10, A./ = X, and 0.01 < X < 0.1 for/ = 1,/Vand 10 < N < 100. 



-0.9 < p < 0.9, 



for = 10, a small number of simulated elements per 
group. 

Application to experimental microarray data 

Our simulation study showed that Pearson correlation 
coefficients obtained from a pool of data coming from 
groups that have different means are explained solely 
by mean differences across groups. Furthermore, we 
showed that pooling data marked by only heteroskedas- 
ticity provides less efficient estimates of correlation coef- 
ficients than does a classical meta-analysis approach 
of combining correlation coefficients into an average. 
The following analysis of experimental microarray data 
illustrates the results predicted by both theory and 
simulation. 

Example data set 

The example data set of this work includes the raw 
expression data from 522 Affymetrix ATH1 gene chips 
(eel files) from AtGenExpress [27]. Cel files are also avail- 
able from The Arabidopsis Information Resourse (TAIR) 
[27,28]; see Table 1 for the experiments ID on TAIR and 
Additional file 2 for detailed information about treatment 
conditions and number of biological replicates. These 
data come from 10 experiments that explored the effect of 



10 types of abiotic stress on RNA accumulation in shoot 
and root of 16 day-old Arabidopsis thaliana seedlings (see 
Table 1 for details). Experiments followed a 3-factorial 
design with treatment (abiotic stress, control), plant mate- 
rial (root, shoot or seedling), and time post-treatment 
as factors [27]. Seven different research groups located 
at different institutions across Germany performed 
experiments; microarray data were generated at the 
German Resource Center for Genome Research (RZPD) 
(according to experiments description in TAIR [28]). 

Experimental data analysis 

We imported data from eel files into the R environ- 
ment [25] and processed the data with MAS5 from the 
open-source Bioconductor R package affy [29,30]. Fol- 
lowing the methodology described in Horan et al [22], 
we did not screen our example data set for quality 
of biological replicates, and therefore no outliers were 
removed. We followed this procedure because the same 
data from the 10 experiments of AtGenExpress [27] were 
also part of the larger data set used in the work by 
Horan et al. [22]. As described in the methodology of 
Mentzen and Wurtele [21], all data were subsequently 
normalized using the median absolute deviation method 
as performed by the function normalizeBetweenArrays 
(with the option "scale") from the open-source Biocon- 
ductor R package Limma [30,31]. We obtained mean 
values of biological replicates after a log transformation 
(base 2) of the normalized expression data. Because the 
two treatment conditions "genotoxic stress applied to root 
1 hour post-treatment" and "heat control applied to shoot 
24 hours post-treatment" had data for only one biologi- 
cal replicate, their expression measurements were used as 
mean values (refer to Additional file 2 for more details). 
Thereafter, mean values of biological replicates were com- 
bined into one large expression matrix (pooled data) 
encompassing 254 columns and 22,810 rows (correspond- 
ing to probe ids/genes). All but two columns of the large 
expression matrix resulted from averaging data of two or 
three biological replicates (refer to Additional file 2 for 
exact number of biological replicates per treatment condi- 
tion). Gene-to-gene Pearson product-moment correlation 
coefficients (r xy ) were obtained from the large expression 
matrix (pooled data) with the R function cor. [25] . 

Mean differences across a pool of microarray data 

We used estimates of the asymptotic expression given in 
Equation 1 to examine the makeup of Pearson correlation 
coefficients obtained from pooling the means of biolog- 
ical replicates of different experimental conditions into 
one large expression matrix. In order to accomplish this 
task, we classified data in columns of the large expression 
matrix into 19 groups. Each group had gene expression 
values from either root or shoot in each of nine types 
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Table 1 Description of the the example data set 
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Shoot 


12 


24 
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48 
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16 


48 



Each microarray experiment is described by its TAIR [28] id and type of abiotic stress; eel files of each experiment can be located on TAIR [28] through its id. Total 
number of downloaded eel files from each experiment is shown on column Cel files. The combination Abiotic stress/Plant part gives the 1 9 groups used for the 
asymptotic analysis of Pearson correlation coefficients. The column n-, shows number of elements in each of 1 9 groups comprising the large expression matrix. The 
column nf shows number of elements in each of 1 9 groups comprising the large matrix of residuals. 



of abiotic stress treatments (see Table 1 for details). We 
adopted this procedure because an exploratory analysis 
showed clear mean differences in gene levels expressed in 
root or shoot. The light stress experiment was for entire 
seedlings, and our analysis did not show mean differ- 
ences that would justify further division of the data from 
this experiment. Each groups name and its correspond- 
ing number of elements nu for groups i = 1, 19, are 
given in Table 1 (number of elements Yi{ correspond to 
the number of mean expression values of a gene within 
group /). 

Data across 19 groups were obviously not homogeneous 
because each group corresponds to a combination of the 
type of abiotic stress and the plant material, which surely 
would have an effect on the total group mean of a gene 
expression. In addition, data within groups cannot strictly 
be considered homogeneous either because gene mean 
expression values within groups correspond to differ- 
ent time points post application of abiotic stress/control 
treatments (further details about treatment conditions 
inside and across groups is given in Additional file 2). 
Because our exploratory analysis indicated that the grand 
mean expression level of genes within groups seemed 
to dominate over means of all other treatment effects 



(data not shown), we considered data within groups as 
roughly homogeneous. 

We used the procedure described in steps 1 through 4 
below to make a diagnostic of r xy obtained from a pool 
of gene expression data coming from 19 heterogeneous 
groups, where fjL xy>i ^ \± xy j and/or ^ E^, for i ^ j. 

1. For a given gene pair xy, obtain estimates 

frxy,i = (Xbyd and t xy>i = ( S *$' 1 ) for all groups 

\>xy,i *y } i J 

i = 1, 19. Here x\ and yi are, respectively, group 
means of expression levels of genes x and y, and 
Sy •, and Sxyj are group variances and covariances of 
expression levels of genes x and y, respectively. 

2. Estimate asymptotic coefficients r xy as 



S Xy + (l X y ^ 

d x dy 



where 



19 

(8) 

i=l 
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19 19 

dxy = ^2^2 - *j)(yi - yj) (9) 

i—l j>i 

19 19 

4=4 + J2J2 w** - ^-) 2 do) 

i=l )>i 
19 19 

4=*}+££w^-^) 2 ( n > 

i—l j>i 

19 

i=l 
19 

5 2 = (13) 

i=l 

3. Use the residual error (r xy — x xy ) to compare Pearson 
correlation coefficients as obtained from the large 
expression matrix and coefficients as estimated 
through Equation 7, which are based on parameter 
estimates of 19 groups of data. 

4. Small residual errors indicate good agreement 
between r xy and x xy . As a result, the two components 
of Equation 7 (i.e. the weighted average of all 
covariances across 19 groups and the weighted 
average of the cross product of mean differences of a 
gene pair xy across 19 groups) can explain the signs 
and magnitudes of r xy , the Pearson correlation 
coefficient obtained from the large expression matrix. 

Pearson correlation coefficients obtained from the pooled 
expression data 

We first obtained Pearson correlation coefficients for pair- 
wise combinations of all 22,810 genes present in the 
large expression matrix, which resulted in more than 260 
million coefficients. The Pearson correlation coefficients 
ranged from -0.992 to 0.998 with 0.008 as the mean value, 
and coefficients showed a symmetric distribution around 
zero; roughly 10% of these coefficients were greater than 
0.7 or less than —0.7 (data shown in Additional file 1). 

Because all genes present in the large expression matrix 
provide more than 260 million pairwise correlation coef- 
ficients, we used a subset of 500 randomly selected genes 
and all their 124,750 pairwise correlation coefficients to 
illustrate potential problems with gene pairwise corre- 
lation coefficients estimated from a pool of microarray 
data. Pearson correlation coefficients from the pooled 
expression data (r xy ) ranged from —0.979 to 0.990 with 
0.007 as the mean value, and coefficients showed a sym- 
metric distribution around zero; roughly 10% of these 
coefficients were greater than 0.7 or less than —0.7, as 
shown in the histogram of Figure 4. The asymptotic 
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Figure 4 Histogram of 124,750 Pearson correlation coefficients 
obtained from the large expression matrix. 



coefficients i xy , estimated according to Equation 7, ranged 
from -0.978 to 0.989 with 0.007 as the mean value (r xy 
values were obtained through the R function given in 
Additional file 3). The histogram of residual errors (r xy — 
x xy ) (Figure 5a) shows a bimodal distribution in which 
the mean value of negative residual errors is —0.008 and 
the mean value of positive residual errors is 0.008. The 
bimodality of the residual errors implies that \r xy \ > \r xy \. 
In addition, the plot of (r xy — r xy ) 2 versus r xy (Figure 5b) 
shows that residual errors are smaller closer to extreme 
values and reach a maximum around ±0.45. The bimodal- 
ity observed in Figure 5a and the shape observed in 
Figure 5b closely follow the bimodality and shape of the 
bias between the Pearson estimator and the true corre- 
lation of a population p, which is approximately p(l — 
p 2 )/(2n) [32,33]. The bias p(l - p 2 )/(2n) is maximized as 
p assumes a value around ±0.58. 

The analysis of all 124,750 pairwise correlations of 
500 randomly selected genes revealed good agreement 
between r xy and x xyi despite the approximations we made 
about homogeneity of data within groups and the rel- 
atively low number of elements in each group; in our 
example data set 12 < m < 18, whereas Hassler and 
Thadewalds example data set had around 90 elements 
in each of two groups [9]. Therefore, our analysis reas- 
sured us that the Pearson correlation coefficients obtained 
from the large expression matrix can be explained by 
heterogeneities due to different means and variances- 
covariances across the 19 groups we used to classify our 
example data set. 
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Next, we show the influence of each term of Equation 7 
on signs and magnitudes of r xyt the Pearson correlation 
coefficients obtained from the large expression matrix. 
The plot of r xy versus s xy (Figure 6a) shows that r xy ranges 
from —1 to +1 for negative and positive values of s xy . 
Therefore, positive or negative covariances of a gene pair 
within each of the 19 groups have no effect on positive or 
negative correlations estimated from the large expression 
matrix. Conversely, the "S" shape observed in plot of r xy 
versus d xy (Figure 6b) indicates that positive or negative 
mean-differences d xy (Equation 9) of a gene pair across the 
19 groups are the sole determinant of the sign of r xyi i.e. 

> 0 ==> r xy > 0 and d xy < 0 ==> r xy < 0. The magni- 
tude of r xy is due mostly to mean differences because the 

correlation between r xy and is 0.98 (the second term 



of Equation 7), whereas the correlation between r xy and 
j^- is 0.31 (the first term of Equation 7). 

Because \r xy \ > 0.7 obtained from pools of microar- 
ray data has been used as the cut-off value representing 
a strong association between gene pairs [21-23], we com- 
puted the percentage contribution of the covariance and 
mean differences terms on the magnitude of \r xy \ > 0.7, 

i- e - y. ~t h + v d 7 j ^ 1- There were 10,567 correla- 

' xy^x^y ' xy^x^y 

tion coefficients with roughly equal numbers distributed 
in the r xy < —0.7 and r xy > 0.7 categories. The median 

of was 1.98% with 50% of the data showing val- 

ues between 0.04% and 5.32%. Conversely, the median of 
d 7 -j % was 96.93% with 50% of the data showing values 
between 93.51% and 98.98%. 
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Figure 6 Influence of covariances and means of 19 groups on signs of the Pearson correlation coefficients obtained from the pooled 
expression data, (a) s xy = £/ 9 */%,/; (b) d xy = J2il] Ey=/+1 Wj&i ~ ^')(y/ - 
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A combination of correlation coefficients between 
expression profiles within each group, given by r xy = 
Yl]=i ^i r xy,h ranged from —0.6 < r xy < 0.89 with 0.132 as 
the mean value. A direct comparison between r xy and r xy 
showed a correlation coefficient of 0.3. 

By applying the asymptotic theory developed by Hassler 
and Thadewald [9] to the Pearson correlation coefficients 
obtained from the large expression matrix, we showed 
that differences in means across 19 heterogeneous groups 
of data is the main factor determining the magnitude 
and sign of coefficients of 124,750 gene pairs. As previ- 
ously shown by Hassler and Thadewald [9], this result 
corroborates that gene pairwise correlation coefficients 
estimated from a pool of microarray data do not mea- 
sure "the closeness of linear relationship" [34] (p. 177) 
between expressions of a gene pair. Instead, they mea- 
sure the extent of mean differences of a gene pair across 
different groups comprising the pool of data. 

Heteroskedasticity across a pool of microarray data 

Here, we examine the case in which Pearson correlation 
coefficients are obtained from a pool of microarray data 
in which only gene pairwise variances-covariances differ 
across groups of data, i.e. Hxyj 7^ ^xy,j f° r i 7^ /• In this 
case, Equation 1 can be written as 



* P * 

r — v r = 

' xy v xy 



2^=1 k i°x,i 2-^i=l k i G y,i 



(14) 



For instance, heteroskedasticity could occur in a situa- 
tion in which data from completely replicated microarray 
experiments are pooled to be examined as one data set. 
As was reported in the work of Goldstein et al [1], 
data variability could differ substantially across replicated 
microarray experiments. 

In order to attain only heteroskedasticity across the 19 
groups of our example data set, we removed the effect of 
varied experimental conditions on expressions of genes 
within each group. For this purpose, we fitted linear mod- 
els to genes (within each group i = 1, 19) and obtained 
their residuals. Following the methodology for differential 
expression of genes proposed by Smith [35], we modeled 
the expression level of all genes in group /, here repre- 
sented by a matrix Y;, with a systematic treatment effect 
(a linear model represented by Zifa) plus error, i.e. 

Yi = ZiPi + ei 

for i = 1, 19. We assumed that e; ~ N(0, £;), where £; is 
the variance-covariance matrix of all genes in each group 
i = 1, 19. We obtained residuals as 

e i = Y i -Z i p i 

where fa was estimated using the open-source Bioconduc- 
tor R package Limma [30,31]. This approach is equivalent 



to subtracting expression levels of each biological replicate 
from their mean values. We used linear models because 
they are well known by the community who works with 
differential expression of microarrays measurements and 
the process of obtaining their residuals is easy and auto- 
matic through the use of the Limma package [30,31]. 

We combined all gene expression residuals from the 19 
groups into one pool of residuals (a large matrix of resid- 
uals including 520 columns and 22,810 rows). Expression 
levels of the two treatment conditions "genotoxic stress 
applied to root 1 hour post-treatment" and "heat control 
applied to shoot 24 hours post-treatment" could not be 
used in the analysis of residuals because they had only 
one biological replicate (refer to Additional file 2 for more 
details). This explains why the matrix of residuals has 520 
columns instead of 522 columns. We repeated the analy- 
sis described in steps 1-4 from the section "Application 
to experimental microarray data" for the data in the large 
matrix of residuals. 

Pearson correlation coefficients estimated from the pooled 
residuals 

Here we show results of the analysis involving the large 
matrix of residuals (pooled residuals) for the same subset 
of 500 genes used in the analysis of the large expression 
matrix. Pearson correlation coefficients (r* y ) of all 124, 
750 pairwise combinations of 500 genes obtained from 
the large matrix of residuals ranged from —0.553 to 0.849 
with 0.01 as the mean value. Their asymptotic counter- 
parts (f * y ), estimated according to Equation 7, ranged 
from -0.554 to 0.849 with 0.01 as the mean value. The 
combination of covariances within each of the 19 groups, 
i.e. Yl]=i ^i s xy,i> determined the sign of r^ y because all 
pairwise mean differences among groups were zero (data 
shown in Additional file 1). 



We then compared f. 



xy 



to the 



y^El X i s l,i X i s ],i 

weighted average of correlations obtained within each of 
the 19 groups of residuals, i.e. r% y = YaLi ^i r xy/> *%y 
ranged from —0.631 to 0.847 with 0.011 as the mean 
value. In addition, we observed a strong linear relationship 
between r*^ estimated from the large matrix of residuals 
and r* y , with a correlation equal to 0.93. Therefore, the 
Pearson correlation coefficients obtained from the large 
matrix of residuals (whose heterogeneities result from dif- 
ferent variances-covariances across the 19 groups) also 
measure a linear relationship between the expression 
residuals of a gene pair. 

Bias of correlation coefficients obtained across 1 9 groups 
of microarray data 

We provide here a performance metric for the correlation 
coefficients estimated across the 19 groups of microarray 
data by assessing their bias with respect to coefficients 
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within each of the 19 groups. We quantified bias as in 
Equation 15: 

B(p xy ) = ^ X ^y-^ l (15) 

where p xy represents the correlation point estimate of a 
gene pair xy across the 19 groups of microarray data and 
Pxy,i represents its counterparts within each group. 

We evaluated the bias (as defined in Equation 15) of 
each of the 124,750 gene pairs' correlation coefficients that 
were obtained according to: (a) p xy = r xy , the Pearson 
correlation coefficients obtained directly from the large 
expression matrix (pooled data); (b) p xy = r xy , the aver- 
age of correlations between expression profiles within i = 
1, 19 groups comprising the large expression matrix; (c) 
p xy = r* y , the Pearson correlation coefficients estimated 
directly from the large matrix of residuals (pooled residu- 
als); and (d) p xy = r* y , the average of correlations between 
expression residuals within i = 1, 19 groups compris- 
ing the large matrix of residuals. For the large expression 
matrix, p xy ^ = r xy> i is the Pearson correlation coeffi- 
cient between expression profiles within each of 19 groups 
comprising the large expression matrix, whereas for the 
large matrix of residuals, p xy j — r* y j is the Pearson cor- 
relation coefficient between expression residuals within 
each of 19 groups comprising the large matrix of residu- 
als. Table 2 gives the statistical summaries of the values 
obtained for B{r xy ), B(f xy ), B(r* y ), and 

The analysis involving the data in the large expres- 
sion matrix (whose heterogeneities were due to means 
and variances-covariances differences across 19 groups) 
resulted in consistently larger statistical summaries of 
B(r xy ) than did those of B(r xy ). In addition, the maximum 
value of B(r xy ) is twice as much the maximum value of 
B(r xy ) (Table 2). For the large matrix of residuals (whose 
heterogeneities were due only to heteroskedasticity), the 
values of B(r* y ) shown in Table 2 are slightly larger than 
are the values of B(r* y ). 

Moreover, more information can be grasped through 
the visualization of biases versus coefficients, as shown 



in Figures 7a and 7b. The trend shown in the plot of 
B(Txy) versus r xy (Figure 7a), where B(r xy ) increases as \r xy \ 
approaches ±1, is very distinct from that shown in the plot 
of B(r xy ) versus r xy (Figure 7b), where B(r xy ) decreases 
as r xy approaches ±1. Indeed, the mean value B(r xy ) for 
\r xy \ > 0.7 is 0.18, whereas the mean value of B(r xy ) for 
\f xy \ > 0.7 is 0.045. The visualization of biases involv- 
ing the large matrix of residuals showed a roughly random 
pattern between B(r* y ) and \r* y \, as \r* y \ decreases to zero 
(data shown in Additional file 1). The plot of B(r* y ) ver- 
sus r* y showed a pattern similar to the one observed in 
Figure 7b, where B(r* y ) decreases as r*^ approaches ±1 
and both B(r* y ), B(r xy ) show maximum values around 
zero (data shown in Additional file 1). 

The plot of B(r* y ) versus B(r* y ) (Figure 8) reveals 
all data points lying below the diagonal, thus imply- 
ing that Birly) < B(r* y ), V r* y , r* r This result cor- 
roborates that, in the case of only heteroskedasticity 
across the 19 groups of microarray data, the combination 
of correlation coefficients performs better than pooling 
data. 

Discussion and conclusion 

Discussion 

In this study, we performed a comprehensive analysis of 
Pearson correlation coefficients obtained from combining 
data of 19 heterogeneous groups of experimental microar- 
ray data into one large pool. By applying the theory devel- 
oped by Hassler and Thadewald [9] to our example data 
set, we determined the specific effect of mean differences 
and heteroskedasticity across the 19 groups on the sign 
and magnitude of the pooled coefficients. In addition, we 
provided a performance metric for correlation coefficients 
by quantifying their biases. 

We quantified the bias of the correlation coefficient 
of a gene pair through the mean squared error between 
its estimate across a pool of groups and its estimates 
within groups. A similar method has been used by Hunter 
and Schmidt to assess the variance of their meta-analysis 
estimator of the Pearson correlation coefficient across 
independent studies [36]. We evaluated the bias of gene- 
to-gene correlations estimated according to the following 



Table 2 Statistical summaries of biases of correlation coefficients 





Min. 


IstQu. 


Median 


3rdQu. 


Max, 


B(r xy ) 


0.022 


0.078 


0.099 


0.132 


0.308 


B(?xy) 


0.021 


0.062 


0.072 


0.083 


0.153 




0.019 


0.049 


0.057 


0.067 


0.141 




0.018 


0.049 


0.057 


0.065 


0.132 



B{r xy ) - bias of the Pearson correlation coefficients estimated directly from the large expression matrix; 
B(r xy ) - bias of the average of correlations between expression profiles within i = 1, 19 groups; 

B(r* y ) - bias of the Pearson correlation coefficients estimated directly from the pooled residuals; B(r* y ) - bias of the average of correlations between expression 
residuals within i = 1, 19 groups. 
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Figure 7 Assessment of biases of the correlation coefficients estimated from 1 9 groups of expression data. B(p xy ) = V — ^ for (a) 

p xy = r xy , the Pearson correlation coefficients estimated directly from the large expression matrix; (b) p xy = 7 xy , the average of correlations between 
expression profiles within /' = 1,19 groups; p xy j is the correlation between expression profiles within each group. 



two methods: (a) by combining 19 groups of microar- 
ray data into a large pool to be analyzed as a single data 
set (pooled data) and (b) by combining correlation coef- 
ficients of each of 19 groups of microarray data into 
an average weighted by the number of elements in each 
group, which corresponds to the Hunter-Schmidt meta- 
analysis estimator of the Pearson correlation coefficient 
across independent studies [36]. 

The data used in this study came from 10 microarray 
experiments (AtGenExpress Project [27]) carried out by 
seven different laboratories distributed across Germany 



that followed the same experimental protocol; these are 
a subset of the large pool of microarray data found in 
the study of Horan et al. [22]. Experiments followed a 3- 
factorial design with treatment (abiotic stress, control), 
tissue (root, shoot, or seedlings in general), and time post- 
treatment as factors [27]. Mean differences within and 
across experiments were a matter-of-fact because statis- 
tically significant differences in gene expression of sev- 
eral types of abiotic stress versus control treatment were 
reported in Kilian et al.'s study [27]. We expected dif- 
ferences due to variability across experiments to remain 
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Figure 8 Comparison between B(r* y ) and B(r* y ). B(r* y ) is the bias of the Pearson correlation coefficients estimated directly from the pooled 
residuals; B(r* ) is the bias of the average of correlations between expression residuals within /' = 1,19 groups. 
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after removing mean differences because of reported dif- 
ficulties in the reproduction of microarray studies [1]. 
Therefore, homogeneity cannot be ensured across experi- 
ments, and combining the means or residuals of biological 
replicates of the 10 experiments into a large pool as a 
single set is not sound from a statistical viewpoint. 

The analysis of the components of the correlation 
coefficients obtained from the large expression matrix 
corroborated the results predicted by both theory and 
simulation that variances-covariances within the 19 
groups had negligible impact on correlation coefficients, 
but different means across the 19 groups had a deci- 
sive effect on the sign as well as on the magnitude of 
coefficients. Coefficients that were greater than 0.7 or 
less than —0.7 showed the largest range of bias (Table 2). 
Therefore, large values of the pooled coefficients were an 
artifact in the sense that they did not communicate a real 
linear association between the expression profiles of two 
genes; rather, they appeared because the data were com- 
bined into a large pool. For this reason, large values of the 
pooled coefficients are in fact an ecological fallacy [10]. 

We also showed through Monte Carlo simulation that 
the structure of different means across a pool of 10 < 
N < 100 groups could generate Simpsons paradox. In 
our case study simulation shown in Figure 2, we showed 
that even though the correlation within each group 
was +0.9, a pool of N (10 < N < 100) groups provided 
negative correlation coefficients because the combination 
of all pairwise mean differences had a negative sign and 
greater magnitude than the positive covariance of the data 
within groups. Hassler and Thadewald [9] studied Simp- 
sons paradox through the analytical solution of Equation 1 
for N = 2, and showed that the occurrence of mean differ- 
ences with opposite signs in both correlated variables is a 
condition for contradictory results between a correlation 
coefficient that is estimated across or within each of two 
groups. 

We combined residuals from fitting linear models of 
every gene into a large matrix of residuals (22,810 rows 
x 520 columns). Here we departed from the assumption 
of independence (common to the analysis of differentially 
expressed genes [35,37]) and considered a multivariate 
normal distribution for residuals within groups, with a 
mean of zero and variance-covariance T> X y,if i = h 19. 
The large matrix of residuals gave us the opportunity 
to evaluate gene pair correlations estimated from a pool 
of data marked by only heteroskedasticity. Our results 
showed that correlation coefficients estimated across the 
19 groups of residuals were closely related to the variance- 
covariances within groups. We also found a strong linear 
relationship between the Pearson correlation coefficients 
obtained from the large matrix of residuals and the coef- 
ficients resulting from averaging correlation estimates 
within groups. However, the heteroskedasticity of the 



data in the large matrix of residuals resulted in less effi- 
cient estimations of the correlation between a gene pair 
than did the classical meta-analysis approach of combin- 
ing correlation coefficients into an average. These results 
were corroborated by Monte Carlo simulations of only 
heteroskedasticity across N > 2 groups of data. 

The results shown in this study indicate that the com- 
bination of statistical results is best suited for estimating 
correlations of a gene pair across several microarray stud- 
ies. Nevertheless, further studies are necessary to assess 
various methods of combining within-groups gene-to- 
gene correlation coefficients. 

Conclusion 

This study demonstrates three aspects of the importance 
of statistical methods in the synthesis of information 
across microarray experiments: 

(A) Large values of gene-to-gene Pearson correlation 
coefficients estimated from a pool of 19 groups of 
microarray data were an ecological fallacy; the effect 
of heterogeneous means across a pool of data 
overpowers the covariance structure of the data. 

(B) The effect of heterogeneous variance-covariances 
across a pool of data causes less efficient estimates of 
Pearson correlation coefficients across groups of 
microarray data than does the approach of combining 
correlation coefficients of individual groups. 

(C) The combination of statistical results is best suited 
for synthesizing the correlation between expression 
profiles of a gene pair across several microarray 
studies. 
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